High-performance computing platforms such as supercomputers havetraditionally been designed to meet the compute demands of scientificapplications. Consequently, they have been architected as producers and notconsumers of data. The Apache Hadoop ecosystem has evolved to meet therequirements of data processing applications and has addressed many of thelimitations of HPC platforms. There exist a class of scientific applicationshowever, that need the collective capabilities of traditional high-performancecomputing environments and the Apache Hadoop ecosystem. For example, thescientific domains of bio-molecular dynamics, genomics and network science needto couple traditional computing with Hadoop/Spark based analysis. Weinvestigate the critical question of how to present the capabilities of bothcomputing environments to such scientific applications. Whereas this questionsneeds answers at multiple levels, we focus on the design of resource managementmiddleware that might support the needs of both. We propose extensions to thePilot-Abstraction to provide a unifying resource management layer. This is animportant step that allows applications to integrate HPC stages (e.g.simulations) to data analytics. Many supercomputing centers have started toofficially support Hadoop environments, either in a dedicated environment or inhybrid deployments using tools such as myHadoop. This typically involves manyintrinsic, environment-specific details that need to be mastered, and oftenswamp conceptual issues like: How best to couple HPC and Hadoop applicationstages? How to explore runtime trade-offs (data localities vs. data movement)?This paper provides both conceptual understanding and practical solutions tothe integrated use of HPC and Hadoop environments.
展开▼